Get to Know Cassandra

Last week we kicked off our Meet Cassandra series – with a short introduction to meet and greet. Today, we’ll begin getting to know her by immediately get hands on with data modeling. Download Cassandra 1.2.2 and follow the below screencast to get it running.

Data modeling with Cassandra (C*) is dramatically different than with RDBMS. With RDBMS, a normalized model is created without considering the exact queries it will service, as SQL can return just about anything through joins. With C*, the data model is designed for the specific queries it will service; as new queries are introduced, the schema is adjusted accordingly. C* has no concept of joins, relationships, or foreign keys. Instead, a separate table is typically leveraged per query, and data required by multiple tables is denormalized across those tables.

/wp-content/uploads/2013/03/CassandraTwitterPic.jpg Let’s consider a simple “Twitter-clone” application, focusing on two queries:

What information does a user have in his/her profile?
What tweets has a user posted?

Let’s make a model for these queries with Cassandra Query Language (CQL) from the cqlsh prompt where our screencast ended. While CQL mimics SQL, don’t let it lure you into thinking it has all the querying power of SQL – it truly contains just a subset. We need a Keyspace to house our tables. Create one with:

CREATE KEYSPACE demo WITH replication = {'class':'SimpleStrategy', 'replication_factor':3};

USE demo;

Don’t worry about the specifics of the above commands for now. Next, let’s create our tables:

CREATE TABLE users (
email varchar,
bio varchar,
birthday timestamp,
active boolean,
PRIMARY KEY (email)
);

CREATE TABLE tweets (
email varchar,
time_posted timestamp,
tweet varchar,
PRIMARY KEY (email, time_posted)
);

Our users table is keyed by user email, and has columns for bio, birthday, etc. We have specified a variety of data types for columns, (varchar, timestamp, …); these are but a subset all the supported types. The tweets table is keyed by both the email of the user who posted the tweet and the time the tweet was posted, and contains a column for the tweet message itself. Now let’s insert some data into the users table and read it back. Inserts are similar to SQL:

INSERT INTO users (email, bio, birthday, active) VALUES
('john.doe@bti360.com', 'BTI360 Teammate', 516513600000, true);

In case you’re wondering, timestamp fields are specified as milliseconds since epoch, hence the big numbers. Exit cqlsh with quit; then add in some sample data from this file with cqlsh --file <file>. You might see some errors regarding keyspaces/tables already created, ignore them. Re-enter cqlsh, and run SELECT * FROM users; and SELECT * FROM tweets; We should see something similar to:

USERS

email	active	bio	birthday
new.user@gmail.com	True	null	null
john.doe@bti360.com	True	BTI360 Teammate	1986-05-15 00:00:00-0400
bob@yahoo.com	False	Hi.	null

TWEETS

email	time_posted	tweet
new.user@gmail.com	2013-02-09 13:38:55-0500	What’s a tweet?
john.doe@bti360.com	2013-02-09 13:30:01-0500	VT FTW
john.doe@bti360.com	2013-02-09 13:43:48-0500	Go orange.

Our select statement for users show nulls where values were omitted for fields, and tweets shows new rows for each tweet for a user. In reality, cqlsh is lying to us. It is presenting the data in a Relational-esque view. Those nulls don’t actually exist, and neither do the extra rows.

Check out our next blog entry when we go behind the scenes and dive deeper into data modeling with Cassandra to see what sets her apart!

BTI 360

Get to Know Cassandra

Chris / March 12, 2013 in
Engineering

USERS

TWEETS

Categories

Stay in Touch

Get to Know Cassandra

Chris / March 12, 2013 in Engineering

USERS

TWEETS

Categories

Have You Met Cassandra?

Diving Deeper with Cassandra

Enjoy our Blog?

Stay in Touch

Chris / March 12, 2013 in
Engineering